If you recall back in the very first lecture, I introduced the concept of tidy data where each row in a dataset is an observation and each column is a variable. Tidy data is typically longer than it is wider.
For example, you could represent data on minimum wage by province and year like this.
| Province | 2019 | 2020 | 2021 | 2022 | ||
|---|---|---|---|---|---|---|
| Alberta | 14 | 15 | 16 | 17 | ||
| Ontario | 14.75 | 15 | 16.25 | 17 | ||
| Quebec | 14.25 | 15.5 | 16 | 17.5 |
This is a very common way to present and read data and it works great when you are presenting data in tabular form because we read from left to right.
However, to visualize complex forms of data, a tidy, or longer,
structure is useful and necessary for using
ggplot2.
The same dataset above looks like this when tidied.
| Province | Year | Wage |
|---|---|---|
| Alberta | 2019 | 14 |
| Alberta | 2020 | 15 |
| Alberta | 2021 | 16 |
| Alberta | 2022 | 17 |
| Ontario | 2019 | 14.75 |
| Ontario | 2020 | 15 |
| Ontario | 2021 | 16.25 |
| Ontario | 2022 | 17 |
| Quebec | 2019 | 14.25 |
| Quebec | 2020 | 15.5 |
| Quebec | 2021 | 16 |
| Quebec | 2022 | 17.5 |
We have already done some conceptual work early in the class distinguishing between observations and variables. This class is focussed more on the tactics to reshape data in different forms in R.
We’re going to be working with some data on unemployment rates by
year from Statistics Canada. The data are contained in = a csv file in
the data subfolder which is pretty straightforward to read.
But there are three problems:
Percent stuck below the
headersAll of these problems can be dealth with using the skip
and the n_max functions.
Because we can see easily that the columns in this case run from 2017
to 2021, maybe we can skip importing the headers and that
percentage row and we’ll just make our own names. That will
solvce problem 1 and 2. And if we look down at the end, maybe we want to
stop at the Unclassified row. It looks like there are just
missing data anyway. So let’s just import rows 15 to 42.
Note, we will need the following libraries for the code to follow. Run this code to install these packages if they are not already installed.
install.packages("tidyverse")
install.packages("haven")
install.packages("here")
install.packages("knitr")
Once installed, they can be loaded with this code.
library(haven)
library(tidyverse)
library(here)
library(knitr)
unemployment<-read.csv(here("data/1410002301-eng.csv"), skip=14, nrows=42-15, header=F)
Let’s take a look:
View(unemployment)
So that worked well. Note that in the column there are some junky numbers at the end of the name of each sector. This would require a manipulation which we can talk about next class. For now we will leave it.
Let’s take another look with glimpse()
glimpse(unemployment)
## Rows: 27
## Columns: 8
## $ V1 <chr> "Total, all industries 5", "Goods-producing sector 6", "Agriculture…
## $ V2 <dbl> 6.4, 5.6, 4.4, 8.0, 11.2, 27.3, 5.6, 2.1, 7.7, 3.7, 3.4, 4.1, 3.5, …
## $ V3 <dbl> 5.9, 5.0, 4.0, 6.9, 10.9, 22.8, 4.8, 1.6, 6.6, 3.6, 3.3, 4.1, 3.3, …
## $ V4 <dbl> 5.7, 5.1, 4.4, 7.4, 11.3, 28.2, 4.9, 1.7, 6.8, 3.7, 3.4, 4.1, 3.3, …
## $ V5 <dbl> 9.5, 8.0, 5.1, 10.9, 10.6, 22.5, 9.9, 1.9, 9.7, 7.1, 7.4, 6.6, 7.1,…
## $ V6 <dbl> 7.5, 5.1, 4.7, 7.2, 9.1, 25.4, 5.2, 1.6, 5.8, 4.5, 4.4, 4.7, 3.9, 4…
## $ V7 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ V8 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
The variable names here are not very helpful and there are two columns filled with missing values. So we need to assign some names to these columns and we need to drop the last two columns which somehow got imported even though they are not necessary.
If you remember from last class, when we use the
select() command we can also rename the variables we are
selecting. So why don’t we just select the first five columns, renaming
each column?
unemployment %>%
select(Sector=1, `2017`=2, `2018`=3, `2019`=4, `2020`=5)->unemployment
unemployment
## Sector 2017 2018 2019 2020
## 1 Total, all industries 5 6.4 5.9 5.7 9.5
## 2 Goods-producing sector 6 5.6 5.0 5.1 8.0
## 3 Agriculture 7 4.4 4.0 4.4 5.1
## 4 Forestry, fishing, mining, quarrying, oil and gas 8 9 8.0 6.9 7.4 10.9
## 5 Forestry and logging and support activities for forestry 11.2 10.9 11.3 10.6
## 6 Fishing, hunting and trapping 27.3 22.8 28.2 22.5
## 7 Mining, quarrying, and oil and gas extraction 8 5.6 4.8 4.9 9.9
## 8 Utilities 2.1 1.6 1.7 1.9
## 9 Construction 7.7 6.6 6.8 9.7
## 10 Manufacturing 3.7 3.6 3.7 7.1
## 11 Durables 3.4 3.3 3.4 7.4
## 12 Non-durables 4.1 4.1 4.1 6.6
## 13 Services-producing sector 10 3.5 3.3 3.3 7.1
## 14 Wholesale and retail trade 3.8 3.8 3.7 7.6
## 15 Wholesale trade 3.4 2.7 2.5 5.7
## 16 Retail trade 4.0 4.1 4.1 8.2
## 17 Transportation and warehousing 3.1 3.4 3.2 7.5
## 18 Finance, insurance, real estate, rental and leasing 2.3 1.9 1.6 3.3
## 19 Finance and insurance 2.3 1.8 1.6 2.6
## 20 Real estate and rental and leasing 2.1 2.1 1.8 5.0
## 21 Professional, scientific and technical services 2.8 2.6 2.7 5.0
## 22 Business, building and other support services 11 7.0 5.7 6.2 10.0
## 23 Educational services 3.5 3.5 3.7 6.7
## 24 Health care and social assistance 1.7 1.7 1.5 3.5
## 25 Information, culture and recreation 5.5 5.0 5.6 13.1
## 26 Accommodation and food services 5.8 5.4 5.6 17.9
## 27 Other services (except public administration) 3.8 2.9 2.8 7.9
This is classically wide data and it is great for presenting
information for tables, primarily because, in English, we read
information from left to right. But for other purposes, it is not great.
Another way to think about this is is that 2017 is not really a
variable. Year is the variable and in the time we observe
the unemployment rate, the variable of year takes on
another value. So those years should be in one column and the
unemployment rate should be in a separate column.
So we need to pivot the columns from 2017 to 2020 to
make two new columns: a Year column and an
Unemployment Rate column. In order to do this we need to
specify which columns we want to pivot down.
unemployment %>%
pivot_longer(., cols=`2017`:`2020`)
## # A tibble: 108 × 3
## Sector name value
## <chr> <chr> <dbl>
## 1 Total, all industries 5 2017 6.4
## 2 Total, all industries 5 2018 5.9
## 3 Total, all industries 5 2019 5.7
## 4 Total, all industries 5 2020 9.5
## 5 Goods-producing sector 6 2017 5.6
## 6 Goods-producing sector 6 2018 5
## 7 Goods-producing sector 6 2019 5.1
## 8 Goods-producing sector 6 2020 8
## 9 Agriculture 7 2017 4.4
## 10 Agriculture 7 2018 4
## # … with 98 more rows
If you remember back in the selection on select() there
are a bunch of helpers we can use to pick and choose the columns we want
to pivot. Sometimes they are out of order; sometimes the names are long
and messy; Sometimes they have a common prefix. So, here, we can also
select by position. We want the 2 through 5th columns.
unemployment %>%
pivot_longer(., cols=2:5)
## # A tibble: 108 × 3
## Sector name value
## <chr> <chr> <dbl>
## 1 Total, all industries 5 2017 6.4
## 2 Total, all industries 5 2018 5.9
## 3 Total, all industries 5 2019 5.7
## 4 Total, all industries 5 2020 9.5
## 5 Goods-producing sector 6 2017 5.6
## 6 Goods-producing sector 6 2018 5
## 7 Goods-producing sector 6 2019 5.1
## 8 Goods-producing sector 6 2020 8
## 9 Agriculture 7 2017 4.4
## 10 Agriculture 7 2018 4
## # … with 98 more rows
Notice that the two new columns don’t have really helpful names. By
default, pivot_longer() names the two new columns
variable and value. But we can change these to
suit our purposes. Here we use the names_to and the
values_to argument. So one column will contain the
names of the pivoted columns and one column will contain the
values.
unemployment %>%
pivot_longer(., cols=`2017`:`2020`, names_to=c("Year"), values_to=c("Unemployment Rate"))->unemployment
head(unemployment)
## # A tibble: 6 × 3
## Sector Year `Unemployment Rate`
## <chr> <chr> <dbl>
## 1 Total, all industries 5 2017 6.4
## 2 Total, all industries 5 2018 5.9
## 3 Total, all industries 5 2019 5.7
## 4 Total, all industries 5 2020 9.5
## 5 Goods-producing sector 6 2017 5.6
## 6 Goods-producing sector 6 2018 5
Notice we have here a perfectly tidy data frame ready for graphing. The x can be the year, the y can be the rate and the color of the line could be the sector.
unemployment %>%
ggplot(., aes(x=Year, y=`Unemployment Rate`, col=Sector, group=Sector))+geom_line()
Obviously we would want to make it more readable, but that will come in the next few classes.
Sometimes, even when data is properly tidy in that every
variable is in a column, we still want to pivot it longer in order to
facilitate visualazation. The reason is the entire philosophy of
ggplot2 is to link any graphical element that varies with
information to one column.
An example is this dataset that contains the immunization coverage rates for DTP and MMR in Toronto schools here.
The data are stored in the file:
immunication-coverage-2017-2018.csv in the
data subfolder.
immunization<-read.csv(file=here("data/immunization-coverage-2017-2018.csv"))
glimpse(immunization)
## Rows: 808
## Columns: 9
## $ X_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12…
## $ School.Name <chr> "A Y JACKSON SECONDARY SCHOOL", "ACAD…
## $ Enrolled.population <int> 1070, 110, 247, 66, 1253, 109, 180, 1…
## $ DTP.coverage.rate.... <dbl> 91.0, 88.2, 91.9, 71.2, 90.7, 98.2, 9…
## $ DTP.Religious.exemption.rate.... <dbl> 1.0, 1.8, 3.6, 19.7, 1.0, 0.0, 0.0, 0…
## $ MMR.coverage.rate.... <dbl> 95.7, 90.0, 94.3, 72.7, 97.8, 98.2, 9…
## $ MMR.Religious.exemption.rate.... <dbl> 1.0, 1.8, 3.6, 19.7, 1.0, 0.0, 0.0, 0…
## $ Lat <dbl> 43.80526, 43.76242, 43.68315, 43.7454…
## $ Lng <dbl> -79.36656, -79.17977, -79.28849, -79.…
Let’s clean the names of the data-set.
library(janitor)
immunization<-clean_names(immunization)
Notice here we have four different columns: the DTP coverage rate, the DTP Religious Exemption Rate, the MMR coverage rate and the MMR Religious Exemption rate.
How many variables ar ethere actually in these columns?
In reality, I would say there are three variables:
It might make sense to compare MMR coverage with MMR exemptions for each school and then DTP Coverage with DTP Exemptions for each school. Or it might make sense to compare the coverages of both with the religious exemptions of both. We can actually do a lot of combinations here.
Turning this into a proper tidy data frame is a bit more tricky, but we can do it.
The first thing to do is to take the four columns that have the
measurements into one column. Let’s call the names_to
column Measure and the values_to column
Value. Remember, we can select variables in
multiple ways.
We can select by position i.e. the fourth through the seventh column
(I got that literally just by counting the columns in a
View or a head() call.). We’ll do that, store
the object and then also just show other ways to do it without saving
the results.
immunization %>%
pivot_longer(., cols=4:7, names_to=c("Measure"), values_to=c("Value")) ->immunization_long
We could do it by specifying the names of the variables to serve as the beginning and the ending:
immunization %>%
pivot_longer(., cols=dtp_coverage_rate:mmr_religious_exemption_rate, names_to=c("Measure"), values_to=c("Value"))
## # A tibble: 3,232 × 7
## x_id school_name enrolled_pop…¹ lat lng Measure Value
## <int> <chr> <int> <dbl> <dbl> <chr> <dbl>
## 1 1 A Y JACKSON SECONDARY SCHOOL 1070 43.8 -79.4 dtp_co… 91
## 2 1 A Y JACKSON SECONDARY SCHOOL 1070 43.8 -79.4 dtp_re… 1
## 3 1 A Y JACKSON SECONDARY SCHOOL 1070 43.8 -79.4 mmr_co… 95.7
## 4 1 A Y JACKSON SECONDARY SCHOOL 1070 43.8 -79.4 mmr_re… 1
## 5 2 ACADEMIE ALEXANDRE-DUMAS 110 43.8 -79.2 dtp_co… 88.2
## 6 2 ACADEMIE ALEXANDRE-DUMAS 110 43.8 -79.2 dtp_re… 1.8
## 7 2 ACADEMIE ALEXANDRE-DUMAS 110 43.8 -79.2 mmr_co… 90
## 8 2 ACADEMIE ALEXANDRE-DUMAS 110 43.8 -79.2 mmr_re… 1.8
## 9 3 ADAM BECK JUNIOR PUBLIC SCHOOL 247 43.7 -79.3 dtp_co… 91.9
## 10 3 ADAM BECK JUNIOR PUBLIC SCHOOL 247 43.7 -79.3 dtp_re… 3.6
## # … with 3,222 more rows, and abbreviated variable name ¹​enrolled_population
We could also get really logical and pick out keywords that mark off
our selections. In this case, we could say, if the variables contain
dtp or mmr, then pivot them down.
immunization %>%
pivot_longer(., cols=contains("mmr")|contains("dpt"), names_to=c("Measure"), values_to=c("Value"))
## # A tibble: 1,616 × 9
## x_id school_name enrol…¹ dtp_c…² dtp_r…³ lat lng Measure Value
## <int> <chr> <int> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 A Y JACKSON SECONDAR… 1070 91 1 43.8 -79.4 mmr_co… 95.7
## 2 1 A Y JACKSON SECONDAR… 1070 91 1 43.8 -79.4 mmr_re… 1
## 3 2 ACADEMIE ALEXANDRE-D… 110 88.2 1.8 43.8 -79.2 mmr_co… 90
## 4 2 ACADEMIE ALEXANDRE-D… 110 88.2 1.8 43.8 -79.2 mmr_re… 1.8
## 5 3 ADAM BECK JUNIOR PUB… 247 91.9 3.6 43.7 -79.3 mmr_co… 94.3
## 6 3 ADAM BECK JUNIOR PUB… 247 91.9 3.6 43.7 -79.3 mmr_re… 3.6
## 7 4 AFRICENTRIC ALTERNAT… 66 71.2 19.7 43.7 -79.5 mmr_co… 72.7
## 8 4 AFRICENTRIC ALTERNAT… 66 71.2 19.7 43.7 -79.5 mmr_re… 19.7
## 9 5 AGINCOURT COLLEGIATE… 1253 90.7 1 43.8 -79.3 mmr_co… 97.8
## 10 5 AGINCOURT COLLEGIATE… 1253 90.7 1 43.8 -79.3 mmr_re… 1
## # … with 1,606 more rows, and abbreviated variable names ¹​enrolled_population,
## # ²​dtp_coverage_rate, ³​dtp_religious_exemption_rate
Now let’s take a look at the reshaped data.
View(immunization_long)
A long version of the immunization dataset
That looks good, but notice that in our new Measure
variable there are actually two variables tucked in there!
Disease and Coverage versus Exemption! It would be useful to separate
our Measure variable into two: one column for disease
(e.g. dtp versus mmr) and the other column
that indicates whether the rate in the value column might be the
coverage rate or the exemption rate.
dplyr offers a function called separate()
that does just that. It’s basically a version of Microsoft Excel’s
text to columns feature.
?separate
The basic syntax is
separate(data, col="columntoseparate", sep="symbol_to_separate_at", into=c("name_of_new_column" , "name_of_new_column2"))
If we look at the variable Measure we see that we could
use the _ to separate out the bits of information. Let’s
try this:
immunization_long %>%
separate(., col="Measure", sep="_", into=c("Disease", "Measure"))->immunization_long
We do get some warnings, but if we look at it, it basically looks pretty good.
View(immunization_long)
The results of separate.
That looks quite good. In the real world, I would want to manipulate this to capitalize the terms and make them look nice (as we will practice next week). But for now we will just carry on.
Now, the beauty of this is that we can group the diseases and measurements any way we want.
immunization_long %>%
ggplot(., aes(x=Value, y=fct_reorder(school_name, Value), fill=Disease))+geom_col(position="dodge")+facet_grid(~Measure)
Obviously this is completely unreadable: but we can do a few things to show what we can do. One is we can stop printing the school names, and turn this into an interactive plot where a reader only receives the schoool names when hovering the click over a bar.
We need the plotly library.
#install.packages('plotly')
library(plotly)
We make our graph, exactly as above: and save it in
plot1. Then run ggplotly on
plot1
immunization_long %>%
ggplot(., aes(x=Value, y=fct_reorder2(school_name, Measure, Value, .desc=F), fill=Disease))+geom_col(position="dodge")+facet_grid(~Measure)+theme(axis.text.y=element_blank())->plot1
ggplotly(plot1)
This is also not really sufficient, but what I want to show you is how we could also compare the coverage and the exemptions by putting the disease, side-by-side.
immunization_long %>%
ggplot(., aes(x=Value, y=fct_reorder2(school_name, Disease, Value, .desc=F), fill=Measure))+geom_col(position="dodge")+facet_grid(~Disease)+theme(axis.text.y=element_blank())->plot2
ggplotly(plot2)
Sometimes however, we need to go the other way. Statistics Canada often provides data on both raw numbers and percentages and it looks like this.
unemployment_long<-read.csv(file=here("data/unemployment_long.csv.csv"))
# Clean the names
unemployment_long<-clean_names(unemployment_long)
This is a data-set of unemployment rates and employment
levels (e..g the number of people employed in a sector) for
2017 to 2021 in Canada. On the one hand, the value column
helpfully includes one
If we look at it and scroll down to the bottom you will see the problem here.
View(unemployment_long)
A long dataset of unemployment rates in Canada.
We need to turn this long data into wide data with one column for unemployment and one column for employment.
names(unemployment_long)
## [1] "ref_date"
## [2] "geo"
## [3] "dguid"
## [4] "labour_force_characteristics"
## [5] "north_american_industry_classification_system_naics"
## [6] "sex"
## [7] "age_group"
## [8] "uom"
## [9] "uom_id"
## [10] "scalar_factor"
## [11] "scalar_id"
## [12] "vector"
## [13] "coordinate"
## [14] "value"
## [15] "status"
## [16] "symbol"
## [17] "terminated"
## [18] "decimals"
We need to specify that we want to take the names of the new column
from the labour_force_characteristics variable and the
values of the new columns have to come from the value
column.
unemployment_long %>%
pivot_wider(., names_from=c("labour_force_characteristics"), values_from=c("value"))->unemployment_wide
View(unemployment_wide)
Problem with the wide dataset
The problem is that the variable
labour_force_characteristics and the variable
uom (Unit of Measurement) are basically duplicate values.
labour_force_characteristics has the values
Employment which measures the number of people employed and
the Unemployment Rate which measures the …unemployment
rate. But the other variable uom has exactly the same
information.
unemployment_long %>%
select(labour_force_characteristics, uom) %>%
slice_sample(n=20)
## labour_force_characteristics uom
## 1 Unemployment rate Percentage
## 2 Unemployment rate Percentage
## 3 Unemployment rate Percentage
## 4 Unemployment rate Percentage
## 5 Employment Persons
## 6 Employment Persons
## 7 Employment Persons
## 8 Employment Persons
## 9 Employment Persons
## 10 Employment Persons
## 11 Employment Persons
## 12 Employment Persons
## 13 Employment Persons
## 14 Unemployment rate Percentage
## 15 Unemployment rate Percentage
## 16 Unemployment rate Percentage
## 17 Employment Persons
## 18 Employment Persons
## 19 Employment Persons
## 20 Employment Persons
So, we end up with a variable called Employment and a
variable called Unemployment Rate but there is also a
variable called UOM and it has entries for both the new
variables.
The quick and dirty fix for this is to just select the bare minimum of variables you need. In this way, we’ll avoid duplication.
unemployment_long %>%
select(ref_date, labour_force_characteristics, north_american_industry_classification_system_naics, age_group, value) %>%
pivot_wider(., names_from=c("labour_force_characteristics"), values_from=c("value"))->unemployment_wide
Now take a look.
View(unemployment_wide)
A wide dataset
now, we could use either Employment or
Unemployment Rate as a y variable depending on what we
wanted.
unemployment_wide %>%
ggplot(., aes(x=ref_date, y=Employment, col=north_american_industry_classification_system_naics))+geom_line()
Or we could pick the unemployment rate.
unemployment_wide %>%
ggplot(., aes(x=ref_date, y=`Unemployment rate`, col=north_american_industry_classification_system_naics))+geom_line()